[Judges] rlhflow pairwise judges #2548

kashif · 2025-01-07T17:02:00Z

What does this PR do?

add support for RLHFlow based pairwise judge

HuggingFaceDocBuilderDev · 2025-01-07T17:10:49Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

qgallouedec · 2025-01-07T17:14:24Z

trl/trainer/judges.py

+        for prompt, completion_pair in zip(batch_prompts, batch_completions):
+            # Convert prompt to chat format
+            instruction = [{"role": "user", "content": prompt}]
+            context = self.tokenizer_plain.apply_chat_template(instruction, tokenize=False)


I recommend using trl.apply_chat_template from the trl data utils here. We've encountered several issues in the past when applying chat templates to partial sequences, and this approach would be more robust.

While one could argue that we control the chat template in this context, using trl.apply_chat_template ensures that any future modifications to the chat template won't introduce unexpected issues here.

so both this and the below changes are how the RLHF model recommends to do the scoring... I can check if it works using the chat template

qgallouedec · 2025-01-07T17:15:16Z

trl/trainer/judges.py

+
+        self.tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
+        self.tokenizer_plain = AutoTokenizer.from_pretrained(model_name, use_fast=True)
+        self.tokenizer_plain.chat_template = "\n{% for message in messages %}{% if loop.index0 % 2 == 0 %}\n\n<turn> user\n {{ message['content'] }}{% endif %}{% endfor %}"


Why do you need to override the chat template btw?

qgallouedec · 2025-01-07T17:34:53Z

From their demo code, this is what I get as input for the model:

<|start_header_id|>user<|end_header_id|>

[CONTEXT] 

<turn> user
 Ellipsis
<turn> assistant
 Ellipsis
<turn> user
 Ellipsis
 [RESPONSE A] BBBB [RESPONSE B] CCCC<|eot_id|>

doesn't make much sense to me:

numerous unnecessary whitespaces
Why <|start_header_id|>user<|end_header_id|>?
Why responses aren't surrounded by \n as well?
Why <eot_id> if you want to further generate?

Why not something like this instead:

[CONTEXT]
<turn> user
Ellipsis
<turn> assistant
Ellipsis
<turn> user
Ellipsis

[RESPONSE A]
BBBB

[RESPONSE B]
CCCC

[BEST REPONSE]

kashif · 2025-01-07T17:41:15Z

you are using the instructions from here: https://huggingface.co/RLHFlow/pair-preference-model-LLaMA3-8B right?

qgallouedec · 2025-01-07T17:42:24Z

you are using the instructions from here: https://huggingface.co/RLHFlow/pair-preference-model-LLaMA3-8B right?

precisely

rlhflow pairwise judge

b3da24a

kashif requested a review from qgallouedec January 7, 2025 17:02

kashif added 2 commits January 7, 2025 18:03

fix helper name

41f479c

undo change

f3e36d9

qgallouedec reviewed Jan 7, 2025

View reviewed changes

kashif added 2 commits January 8, 2025 11:52

Merge branch 'main' into openrlhf-judge

a5a89f9

skip test on wondows

52cb514

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Judges] rlhflow pairwise judges #2548

[Judges] rlhflow pairwise judges #2548

kashif commented Jan 7, 2025

HuggingFaceDocBuilderDev commented Jan 7, 2025

qgallouedec Jan 7, 2025

kashif Jan 7, 2025

qgallouedec Jan 7, 2025

qgallouedec commented Jan 7, 2025

kashif commented Jan 7, 2025

qgallouedec commented Jan 7, 2025

[Judges] rlhflow pairwise judges #2548

Are you sure you want to change the base?

[Judges] rlhflow pairwise judges #2548

Conversation

kashif commented Jan 7, 2025

What does this PR do?

HuggingFaceDocBuilderDev commented Jan 7, 2025

qgallouedec Jan 7, 2025

Choose a reason for hiding this comment

kashif Jan 7, 2025

Choose a reason for hiding this comment

qgallouedec Jan 7, 2025

Choose a reason for hiding this comment

qgallouedec commented Jan 7, 2025

kashif commented Jan 7, 2025

qgallouedec commented Jan 7, 2025